Skip to content

Conversation

@oneby-wang
Copy link
Contributor

@oneby-wang oneby-wang commented Dec 21, 2025

Fixes #25092

Motivation

ManagedCursorTest.asyncMarkDeleteBlocking() test is very flaky after PR #25087, so fix it.

Modifications

  1. When doing local debug, I see many logs like this:
org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$MarkDeletingMarkedPosition: Mark deleting an already mark-deleted position. Current mark-delete: 13:9 -- attempted mark delete: 13:8
Caused by: org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$MarkDeletingMarkedPosition: Mark deleting an already mark-deleted position. Current mark-delete: 13:9 -- attempted mark delete: 13:8
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition(ManagedCursorImpl.java:2076)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncMarkDelete(ManagedCursorImpl.java:2212)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.maybeUpdateCursorBeforeTrimmingConsumedLedger(ManagedLedgerImpl.java:2736)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$14.operationComplete(ManagedLedgerImpl.java:1768)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$14.operationComplete(ManagedLedgerImpl.java:1747)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$asyncUpdateLedgerIds$7(MetaStoreImpl.java:187)

So I modify if (!lastAckedPosition.equals(cursor.getMarkDeletedPosition())) to if (lastAckedPosition.compareTo(cursor.getMarkDeletedPosition()) > 0) in ManagedLedgerImpl.

  1. Modify some typos, and add some todos(could be deleted if not needed) in ManagedLedgerImpl.

  2. Add simplest test case method: asyncMarkDeleteBlockingWithOneShot(), see method comments. I move c1.asyncMarkDelete out of asyncAddEntry#addComplete callback method, because the two operations will compete for the same ledgerId due to race condition, which is the reason why asyncMarkDeleteBlocking() test is more flaky after [fix][broker] Fix cursor position persistence in ledger trimming #25087. Some logs if I put c1.asyncMarkDelete in asyncAddEntry#addComplete callback method.

2025-12-19T21:58:48,996 - INFO  - [main:ManagedCursorTest@1405] - size: 100, positions: [3:0, 3:1, 3:2, 3:3, 3:4, 3:5, 3:6, 3:7, 3:8, 3:9, 5:0, 5:1, 5:2, 5:3, 5:4, 5:5, 5:6, 5:7, 5:8, 5:9, 7:0, 7:1, 7:2, 7:3, 7:4, 7:5, 7:6, 7:7, 7:8, 7:9, 9:0, 9:1, 9:2, 9:3, 9:4, 9:5, 9:6, 9:7, 9:8, 9:9, 11:0, 11:1, 11:2, 11:3, 11:4, 11:5, 11:6, 11:7, 11:8, 11:9, 13:0, 13:1, 13:2, 13:3, 13:4, 13:5, 13:6, 13:7, 13:8, 13:9, 14:0, 14:1, 14:2, 14:3, 14:4, 14:5, 14:6, 14:7, 14:8, 14:9, 16:0, 16:1, 16:2, 16:3, 16:4, 16:5, 16:6, 16:7, 16:8, 16:9, 18:0, 18:1, 18:2, 18:3, 18:4, 18:5, 18:6, 18:7, 18:8, 18:9, 19:0, 19:1, 19:2, 19:3, 19:4, 19:5, 19:6, 19:7, 19:8, 19:9]

2025-12-19T21:58:48,625 - INFO  - [main:ManagedCursorTest@1405] - size: 100, positions: [3:0, 3:1, 3:2, 3:3, 3:4, 3:5, 3:6, 3:7, 3:8, 3:9, 5:0, 5:1, 5:2, 5:3, 5:4, 5:5, 5:6, 5:7, 5:8, 5:9, 7:0, 7:1, 7:2, 7:3, 7:4, 7:5, 7:6, 7:7, 7:8, 7:9, 8:0, 8:1, 8:2, 8:3, 8:4, 8:5, 8:6, 8:7, 8:8, 8:9, 10:0, 10:1, 10:2, 10:3, 10:4, 10:5, 10:6, 10:7, 10:8, 10:9, 11:0, 11:1, 11:2, 11:3, 11:4, 11:5, 11:6, 11:7, 11:8, 11:9, 13:0, 13:1, 13:2, 13:3, 13:4, 13:5, 13:6, 13:7, 13:8, 13:9, 15:0, 15:1, 15:2, 15:3, 15:4, 15:5, 15:6, 15:7, 15:8, 15:9, 17:0, 17:1, 17:2, 17:3, 17:4, 17:5, 17:6, 17:7, 17:8, 17:9, 19:0, 19:1, 19:2, 19:3, 19:4, 19:5, 19:6, 19:7, 19:8, 19:9]
  1. Add asyncMarkDeleteBlockingWithMultiShots() test to reproduce race condition between c1.asyncMarkDelete and asyncAddEntry#addComplete, see method comments.

  2. Fix asyncMarkDeleteBlocking() flaky test, see method comments.
    a. Fix MetadataStoreException$BadVersionException due to race condition between ledger rollover and ledger recovery read.
    b. Fix assert failure due to race condition race condition between c1.asyncMarkDelete and asyncAddEntry#addComplete

  3. And more tests, see PR discussion.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: oneby-wang#16

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 21, 2025
@oneby-wang oneby-wang force-pushed the ManagedCursorTest_asyncMarkDeleteBlocking_flaky_test branch from 2374c98 to 12c1124 Compare December 22, 2025 08:32
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the comments about preferring AssertJ

@lhotari
Copy link
Member

lhotari commented Dec 23, 2025

@oneby-wang There seems to be a few more locations where the same flakiness pattern occurs. please check at least cursorPersistenceAsyncMarkDeleteSameThread. testLazyCursorLedgerCreationForSubscriptionCreation appeared flaky too and it would be worth checking whether it's the same problem.
Please also check NonDurableCursorTest.asyncMarkDeleteBlocking.

@oneby-wang
Copy link
Contributor Author

Got the two above comments.

@oneby-wang
Copy link
Contributor Author

@lhotari Maybe we should do asyncMarkDelete if we reach this condition? Since this is the root casue to inconsistency.

if (ledgerInfo != null && ledgerInfo.getLedgerId() != currentLedger.getId()
&& ledgerInfo.getEntries() == slowestReaderPosition.getEntryId() + 1) {
slowestReaderLedgerId = slowestReaderPosition.getLedgerId() + 1;

@lhotari
Copy link
Member

lhotari commented Dec 23, 2025

@lhotari Maybe we should do asyncMarkDelete if we reach this condition? Since this is the root casue to inconsistency.

if (ledgerInfo != null && ledgerInfo.getLedgerId() != currentLedger.getId()
&& ledgerInfo.getEntries() == slowestReaderPosition.getEntryId() + 1) {
slowestReaderLedgerId = slowestReaderPosition.getLedgerId() + 1;

@oneby-wang I guess that could make sense, however the problem is that there could be multiple cursors at the slowest position. ManagedCursorContainerImpl/ManagedCursorContainer doesn't have a sorted data structure which could be iterated. ActiveManagedCursorContainer has the method getNumberOfCursorsAtSamePositionOrBefore which solves this detail. It's currently used for activeCursors in ManagedLedgerImpl when cacheEvictionByExpectedReadCount is enabled. It would be possible to use it also for other use cases. The performance of ActiveManagedCursorContainerImpl is better than ManagedCursorImpl also for the ordinary use cases. There are micro benchmarks in the microbench module.

However, it might be better to handle it separately. Just wondering if it's already handled by the advanceCursorsIfNecessary method:

try {
advanceCursorsIfNecessary(ledgersToDelete);
} catch (LedgerNotExistException e) {
log.info("First non deleted Ledger is not found, stop trimming");
metadataMutex.unlock();
trimmerMutex.unlock();
return;
}

@oneby-wang
Copy link
Contributor Author

oneby-wang commented Dec 23, 2025

Just wondering if it's already handled by the advanceCursorsIfNecessary method:

I read this code. If we mark delete 12:9, next ledger is an empty ledger: 13:-1, seems this method will only move mark delete position to 12:8. If ledger 12 is trimmed, then causing inconsistency.

I'm wondering if mark delete position with -1 entryId is a valid mark delete position? Seems some codes intend to avoid this.

// Just ack messages like a consumer. Normally, consumers will not confirm a position that does not exist, so
// find the latest existing position to ack.
Position highestPositionToDelete = calculateLastEntryInLedgerList(ledgersToDelete);

// bypass empty ledgers, find last ledger with Message if possible.
while (lastConfirmedEntry.getEntryId() == -1) {
Map.Entry<Long, LedgerInfo> formerLedger = ledgers.lowerEntry(lastConfirmedEntry.getLedgerId());
if (formerLedger != null) {
LedgerInfo ledgerInfo = formerLedger.getValue();
lastConfirmedEntry =
PositionFactory.create(ledgerInfo.getLedgerId(), ledgerInfo.getEntries() - 1);
} else {
break;
}
}

EDIT: if a cursor subscribe an empty managed ledger, then it's mark delete start with ledgerId:-1, it is a valid mark delete position.

@lhotari
Copy link
Member

lhotari commented Dec 23, 2025

There seems to be a few more locations where the same flakiness pattern occurs. please check at least cursorPersistenceAsyncMarkDeleteSameThread. testLazyCursorLedgerCreationForSubscriptionCreation appeared flaky too and it would be worth checking whether it's the same problem.
Please also check NonDurableCursorTest.asyncMarkDeleteBlocking.

@oneby-wang Please address this so that we could get at least some improvements made to the flakiness. The NonDurableCursorTest.asyncMarkDeleteBlocking method is really flaky too. I'd like to merge this PR very soon although we haven't addressed all the details that have come up during the review.

@oneby-wang
Copy link
Contributor Author

@lhotari Working.

@lhotari
Copy link
Member

lhotari commented Dec 23, 2025

I read this code. If we mark delete 12:9, next ledger is an empty ledger: 13:-1, seems this method will only move mark delete position to 12:8. If ledger 12 is trimmed, then causing inconsistency.

I'm wondering if mark delete position with -1 entryId is a valid mark delete position? Seems some codes intend to avoid this.

good observations. It seems that there's a lot of code locations where it's hard to determine which is the correct behavior.
There seems to be some inconsistency in how things are handled across different parts of the ManagedLedger implementation. I guess we can focus on addressing the test flakiness.

@oneby-wang
Copy link
Contributor Author

I think maybe we can fix cursor position and ledger inconsistency by advanceCursorsIfNecessary method.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however it seems that BadVersionException could occur in all cases where the ledger is opened with a different instance without closing the previous one. I think that retry would be needed to address those cases. We could handle that in a separate PR since this PR will already improve significantly the current CI flakiness.

Thanks for the great work, @oneby-wang

@oneby-wang oneby-wang changed the title [fix][test] Fix ManagedCursorTest.asyncMarkDeleteBlocking() flaky test [fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests Dec 23, 2025
@oneby-wang
Copy link
Contributor Author

I’ll revisit the BadVersionExceptions over the next few days and submit a PR to address it. Thanks for reviewing and your great suggestions. @lhotari

@oneby-wang
Copy link
Contributor Author

however it seems that BadVersionException could occur in all cases where the ledger is opened with a different instance without closing the previous one.

I think just in rollover and reopen race condition case. If ledger is not rolling over, which will not update metastore, so reopen ledger will success to update metastore without throwing BadVersionException.

@lhotari lhotari merged commit 9c5e1c3 into apache:master Dec 23, 2025
101 of 105 checks passed
@lhotari lhotari added this to the 4.2.0 milestone Dec 23, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.49%. Comparing base (621b41f) to head (a1ef783).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #25101       +/-   ##
=============================================
+ Coverage     39.14%   74.49%   +35.35%     
- Complexity    13441    33679    +20238     
=============================================
  Files          1842     1899       +57     
  Lines        145491   149655     +4164     
  Branches      16907    17393      +486     
=============================================
+ Hits          56949   111482    +54533     
+ Misses        80889    29311    -51578     
- Partials       7653     8862     +1209     
Flag Coverage Δ
inttests 26.60% <ø> (+0.02%) ⬆️
systests 23.02% <ø> (-0.07%) ⬇️
unittests 74.03% <ø> (+38.98%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1406 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: ManagedCursorTest.asyncMarkDeleteBlocking

3 participants